[My IODS project github data repository][https://github.com/maivu2054/IODS-project]
Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax. #Title: Introduction to open Data Science 2019
I join this course to discover the world of data science. Beside, I would like to make friend with people who work in the health science area. I am currently a PhD student of the University of Eastern Finland.
Describe the work you have done this week and summarize your learning. ##Data wrangling step Prepare data from original file R script is here Data is here
*First install.packages(c(“dplyr”,“GGally”,“tidyverse”,“xlsx”,“ggplot2”))
library(dplyr)
library (ggplot2)
library(readxl)
setwd(getwd())
library(dplyr)
learning2014 <- readxl::read_excel("data/learning2014.xlsx") %>%
mutate_at(vars(gender), factor)
dim (learning2014)
head (learning2014)
str(learning2014)Data includes seven variables: gender,age, attitude, deep, stra, surf, points. In which gender: M (Male), F (Female) age:Age (in years) derived from the date of birth attitude: Global attitude toward statistics deep: average of score related to deep learning stra: average of score related to surf: average of score related to surface questions points: Exam points full data from this :https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-data.txt
head(learning2014)## # A tibble: 6 x 7
## gender age attitude deep stra surf points
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 53 3.7 3.58 3.38 2.58 25
## 2 M 55 3.1 2.92 2.75 3.17 12
## 3 F 49 2.5 3.5 3.62 2.25 24
## 4 M 53 3.5 3.5 3.12 2.25 10
## 5 M 49 3.7 3.67 3.62 2.83 22
## 6 F 38 3.8 4.75 3.62 2.42 21
#describe some variables
as.numeric(learning2014$age)## [1] 53 55 49 53 49 38 50 37 37 42 37 34 34 34 35 33 32 44 29 30 27 29 31
## [24] 37 26 26 30 33 33 28 26 27 25 31 20 39 38 24 26 25 30 25 30 48 24 40
## [47] 25 23 25 23 27 25 23 23 23 45 22 23 21 21 21 21 26 25 26 21 23 22 22
## [70] 22 23 22 23 24 22 23 22 20 22 21 22 23 21 22 29 29 21 28 21 30 21 23
## [93] 21 21 20 22 21 21 20 22 20 20 24 20 19 21 21 22 25 21 22 25 20 24 20
## [116] 21 20 20 31 20 22 22 21 22 21 21 21 21 20 21 25 21 24 20 21 20 20 18
## [139] 21 19 21 20 21 20 21 20 18 22 22 24 19 20 20 17 19 20 20 20 20 19 19
## [162] 22 35 18 19 21
summaryvar <- learning2014 %>%summary (c("age", "deep", "stra", "surf", "points" ))## Warning in if (length(ll) > maxsum) {: the condition has length > 1 and
## only the first element will be used
print (summaryvar)## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
library(dplyr)
library (GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggplot2)
library(tidyverse)## -- Attaching packages --------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(psych)##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
#see the correlation between variables
pairs.panels (learning2014)Each individual plot shows the relationship between the variable in the horizontal vs the vertical of the grid gender is the discrete variable so we just forcus on continous variables like: age, attitude, deep, stra, surf, points
For example: correlation between age and attitude = 0.02 attitude toward statistics of people in the survey has positive correlation with age, deep learning, strategy, but negative correlation with surface learning
The diagonal is showing a histogram of each variable and it can be seen on the graph that point and attitude has a strong correlation (r = 0.44)
library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014,
mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20))
)Test association between points and attitude, deep, stra
cor.test(learning2014$attitude,learning2014$points)##
## Pearson's product-moment correlation
##
## data: learning2014$attitude and learning2014$points
## t = 6.2135, df = 164, p-value = 4.119e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3044463 0.5521335
## sample estimates:
## cor
## 0.4365245
cor.test(learning2014$deep,learning2014$points)##
## Pearson's product-moment correlation
##
## data: learning2014$deep and learning2014$points
## t = -0.12997, df = 164, p-value = 0.8967
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1622192 0.1423931
## sample estimates:
## cor
## -0.01014849
cor.test(learning2014$stra,learning2014$points)##
## Pearson's product-moment correlation
##
## data: learning2014$stra and learning2014$points
## t = 1.8916, df = 164, p-value = 0.06031
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.006340149 0.291945961
## sample estimates:
## cor
## 0.1461225
Results of correlation in the cor.test between attitude and points = 0.4365245 (p_value= 4.119e-09)
Results of correlation in the cor.test between deep and points = -0.01014849 (p_value= 0.8967)
Results of correlation in the cor.test between stra and points = 0.1461225 (p_value= 0.06031)
Look at the p_value: association between points and deep and stra is not statistically significant relationship so remove deep and stra variables from model
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3 + ggtitle( "Relation between attitude and points")
print (p4)The equation for the model is \[ Y_i = \alpha + \beta_1 X_i + \epsilon_i \] where Y represent points, X is attitude, \(\alpha\) is constant, \(\beta_1\) is regression coefficient for attitude, and \(\epsilon\) is a random term.
m1 <- lm(learning2014$points ~learning2014$attitude, data = learning2014)
results <- summary(m1)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 11.637 | 1.830 | 6.358 | 0 |
| learning2014$attitude | 3.525 | 0.567 | 6.214 | 0 |
Intercept as well as attitude are statistically significant predictors. Coefficient of determination \(R^2\) = 0.1905537 that is not particularly high.
plot(m1, which=c(1,2,5))This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires.The datasets provides information regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks.
*Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details). The data (n=382) has 35 different variables and is a joined data of the two datasets.
+school: binary variables: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira
Characteristics of participants: sex , age
Demographic and family information including variables: Address, famsize, Pstatus, Medu, Fedu,
Some variables about educational information: absences - numeric variable- number of school absences, failures: - number of past class failures, nursery: - attended nursery school (binary: yes or no), internet: - Internet access at home (binary: yes or no), guardian: - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
G1, G2, G3: express grades are related with the course subject, Math or Portuguese.
So I want to see how age (age). sex (sex), a romantic relationship (* romantic ), and number of school absences (absences*) variables effect to cohort consumption.
The hyopthesis is that getting older, being male, not in good relationship with partner, and more days of school absence will increase alcohol consumption.
g1 <- ggplot(data = dat, aes(x = high_use, fill = sex))
g1 + geom_bar() From the plot 1, it can be seen that male take higher proportion than female in high_use.
g2 <- ggplot(data = dat, aes(x = high_use, fill = romantic ))
g2 + geom_bar()It can be seen here somehow not in a romantic relationship took more proportion of datohol consumption compare to in a romantic relationship
g3 <- ggplot(data = dat, aes(x = high_use, y= absences))
g3 + geom_boxplot()More day absences from school increased datohol consumption
m <- glm(high_use ~ age+ sex + romantic+ absences, data = dat, family = "binomial")
summary(m)##
## Call:
## glm(formula = high_use ~ age + sex + romantic + absences, family = "binomial",
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7179 -0.8456 -0.6286 1.0823 2.1893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.54258 1.70433 -2.665 0.00769 **
## age 0.16915 0.10226 1.654 0.09810 .
## sexM 0.99159 0.24303 4.080 4.50e-05 ***
## romanticyes -0.29584 0.26585 -1.113 0.26579
## absences 0.07283 0.01828 3.983 6.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 422.16 on 377 degrees of freedom
## AIC: 432.16
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp## Waiting for profiling to be done...
cbind (OR, CI)## OR 2.5 % 97.5 %
## (Intercept) 0.01064589 0.0003570945 0.290015
## age 1.18430022 0.9701859445 1.450229
## sexM 2.69551085 1.6842578252 4.375275
## romanticyes 0.74390627 0.4371001910 1.242806
## absences 1.07554516 1.0398995842 1.116863
Interpret result: + Age: getting older has associate to increase datohol usage, however, the effect of age variable is not significant (P-value = 0.10 )
Being a male increase probability use more datohol than 2.69 being female (OR = 2.69), the result is statistical significant
In a romantic relationship decrease alcohol consumption, however, the effect of this variable is not significant (P-value = 0.26)
The number of day absences from school increase the probability of use more datohol 1.07 (95%CI = 1.04 - 1.11)
Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.
probabilities <- predict(m, type = "response")
dat <- mutate(dat, probability = probabilities)
dat <- mutate(dat, prediction = probabilities > 0.5)
# tabulate the target variable versus the predictions
table(high_use = dat$high_use, prediction = dat$prediction)## prediction
## high_use FALSE TRUE
## FALSE 261 9
## TRUE 88 24
So the model so
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)}
library(boot)##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
cv <- cv.glm(data = dat, cost = loss_func, glmfit = m, K = nrow(dat))
summary (cv)## Length Class Mode
## call 5 -none- call
## K 1 -none- numeric
## delta 2 -none- numeric
## seed 626 -none- numeric
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library (tidyverse)
library (corrplot)## corrplot 0.84 loaded
library (plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#Load dataset Boston from MASS
data("Boston")
#look at data Boston
head (Boston)## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
#look at the structure of dataset
str(Boston)## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
data(Boston)Data summary: This is the dataset about housing Values in Suburbs of Boston Description: The Boston data frame has 506 rows and 14 columns with explaination of variables below:
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
## NULL
## crim zn indus chas
## Min. :-0.3900 Min. :-0.5700 Min. :-0.7100 Min. :-0.12000
## 1st Qu.:-0.2150 1st Qu.:-0.4050 1st Qu.:-0.3825 1st Qu.:-0.04750
## Median : 0.3200 Median :-0.2550 Median : 0.3950 Median : 0.02000
## Mean : 0.1786 Mean :-0.0550 Mean : 0.1929 Mean : 0.08143
## 3rd Qu.: 0.4500 3rd Qu.: 0.2775 3rd Qu.: 0.6300 3rd Qu.: 0.09000
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## nox rm age dis
## Min. :-0.770 Min. :-0.61000 Min. :-0.7500 Min. :-0.7700
## 1st Qu.:-0.360 1st Qu.:-0.29750 1st Qu.:-0.2625 1st Qu.:-0.5225
## Median : 0.305 Median :-0.21500 Median : 0.3050 Median :-0.3050
## Mean : 0.190 Mean :-0.01286 Mean : 0.1736 Mean :-0.1464
## 3rd Qu.: 0.655 3rd Qu.: 0.19000 3rd Qu.: 0.5775 3rd Qu.: 0.2400
## Max. : 1.000 Max. : 1.00000 Max. : 1.0000 Max. : 1.0000
## rad tax ptratio black
## Min. :-0.4900 Min. :-0.5300 Min. :-0.5100 Min. :-0.44000
## 1st Qu.:-0.2850 1st Qu.:-0.3050 1st Qu.:-0.2175 1st Qu.:-0.37750
## Median : 0.4600 Median : 0.4850 Median : 0.2250 Median :-0.22500
## Mean : 0.2371 Mean : 0.2364 Mean : 0.1157 Mean :-0.06071
## 3rd Qu.: 0.6075 3rd Qu.: 0.6475 3rd Qu.: 0.3775 3rd Qu.: 0.16750
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## lstat medv
## Min. :-0.7400 Min. :-0.74000
## 1st Qu.:-0.4000 1st Qu.:-0.46000
## Median : 0.4150 Median :-0.38000
## Mean : 0.1407 Mean :-0.06857
## 3rd Qu.: 0.5775 3rd Qu.: 0.31000
## Max. : 1.0000 Max. : 1.00000
In fig1 shows the distribution of variables and their relationship. For example:chas variables is the categorical variable
#scale variables
boston_scaled <- scale (Boston)
summary (boston_scaled%>%round (digits = 2))## crim zn indus
## Min. :-0.420000 Min. :-0.490000 Min. :-1.560000
## 1st Qu.:-0.410000 1st Qu.:-0.490000 1st Qu.:-0.870000
## Median :-0.390000 Median :-0.490000 Median :-0.210000
## Mean :-0.000119 Mean :-0.002154 Mean :-0.001304
## 3rd Qu.: 0.010000 3rd Qu.: 0.050000 3rd Qu.: 1.010000
## Max. : 9.920000 Max. : 3.800000 Max. : 2.420000
## chas nox rm
## Min. :-0.270000 Min. :-1.46e+00 Min. :-3.880000
## 1st Qu.:-0.270000 1st Qu.:-9.10e-01 1st Qu.:-0.570000
## Median :-0.270000 Median :-1.40e-01 Median :-0.110000
## Mean : 0.001838 Mean : 5.93e-05 Mean : 0.000138
## 3rd Qu.:-0.270000 3rd Qu.: 6.00e-01 3rd Qu.: 0.480000
## Max. : 3.660000 Max. : 2.73e+00 Max. : 3.550000
## age dis rad
## Min. :-2.330000 Min. :-1.270000 Min. :-9.80e-01
## 1st Qu.:-0.837500 1st Qu.:-0.800000 1st Qu.:-6.40e-01
## Median : 0.315000 Median :-0.280000 Median :-5.20e-01
## Mean : 0.000415 Mean :-0.000138 Mean : 5.93e-05
## 3rd Qu.: 0.907500 3rd Qu.: 0.660000 3rd Qu.: 1.66e+00
## Max. : 1.120000 Max. : 3.960000 Max. : 1.66e+00
## tax ptratio black
## Min. :-1.3100000 Min. :-2.70000 Min. :-3.900000
## 1st Qu.:-0.7700000 1st Qu.:-0.49000 1st Qu.: 0.202500
## Median :-0.4600000 Median : 0.27500 Median : 0.380000
## Mean : 0.0001383 Mean : 0.00164 Mean :-0.000079
## 3rd Qu.: 1.5300000 3rd Qu.: 0.81000 3rd Qu.: 0.430000
## Max. : 1.8000000 Max. : 1.64000 Max. : 0.440000
## lstat medv
## Min. :-1.53000 Min. :-1.9100000
## 1st Qu.:-0.79750 1st Qu.:-0.5975000
## Median :-0.18000 Median :-0.1400000
## Mean : 0.00002 Mean : 0.0000988
## 3rd Qu.: 0.60000 3rd Qu.: 0.2700000
## Max. : 3.55000 Max. : 2.9900000
#to see class of the scaled object
class (boston_scaled)## [1] "matrix"
#change boston_scaled from matrix into data frame
boston_scaled <-as.data.frame(boston_scaled)
summary(Boston$crim)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
#creat a quantile vector of crim
bins <- quantile(boston_scaled$crim)
#see quantile parts of bins
bins## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low","med_high", "high") , include.lowest = TRUE)
table (crime)## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]All means of variables move to 0. Now the crime variable is set to 4 categories according to how high is the per capita crime rate by town.
# linear discriminant analysis with target variable crime
lda_crime <- lda(crime ~ ., data = train)
lda_crime## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2648515 0.2376238 0.2549505 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 0.9234977 -0.9324407 -0.12514775 -0.8537501 0.45119852
## med_low -0.1029092 -0.3057434 -0.06727176 -0.5824053 -0.21459798
## med_high -0.3939928 0.2197965 0.22458650 0.3433826 0.04769488
## high -0.4872402 1.0171960 -0.07145661 1.0334512 -0.43823283
## age dis rad tax ptratio
## low -0.8604835 0.8736358 -0.6909978 -0.7275014 -0.4819447
## med_low -0.4551289 0.4030781 -0.5428218 -0.5174903 -0.1088905
## med_high 0.3309676 -0.3459783 -0.3942575 -0.2827545 -0.2346298
## high 0.8210575 -0.8686206 1.6373367 1.5134896 0.7798552
## black lstat medv
## low 0.37751953 -0.768332284 0.50756501
## med_low 0.34775972 -0.105294907 -0.02044281
## med_high 0.05596341 0.002548486 0.14284884
## high -0.85123092 0.934863303 -0.69466744
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07759509 0.71610089 -0.90410288
## indus 0.01994827 -0.35605564 0.40148476
## chas -0.08160859 -0.10305010 0.04319537
## nox 0.38057367 -0.73304732 -1.25425587
## rm -0.07539351 -0.03221574 -0.21028415
## age 0.25731820 -0.18853468 -0.35134912
## dis -0.07614425 -0.30196649 0.21210278
## rad 2.87052084 1.07532814 0.26861688
## tax 0.07069853 -0.13953718 0.19929002
## ptratio 0.12561166 -0.04162764 -0.33171216
## black -0.15675438 0.01380473 0.18639181
## lstat 0.22706481 -0.23118829 0.48235460
## medv 0.19979188 -0.44504120 -0.16903215
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9448 0.0394 0.0158
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
# plot lda with 2 discriminants
plot(lda_crime, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_crime, myscale = 1)# eigenvalues:
lda_crime$svd## [1] 39.351787 8.040404 5.081920
The purpose of linear discriminant analysis (LDA) is to find the linear combinations of the original variables (in Boston dataset) that gives the best possible separation between the groups.
We separate the crime into 4 groups: “low”, “med_low”,“med_high”, “high”. so the number of groups G=4, and the number of variables is 13 (p=13). The maximum number of useful discriminant functions that can separate is the minimum of G-1 and p. Thus, we can find in the result 3 useful discriminant functions (Proportion of trace: LD1, LD2, LD3) to separate.
Ratio of the between- and within-group standard deviations on the linear discriminant variables is higher -> model is more precious. LD1 = 0.96 means this model can discriminate 96% difference between groups
In the above picture we can see the linear combination of the variables that separate the crime classes.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda_crime, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)## predicted
## correct low med_low med_high high
## low 14 4 2 0
## med_low 8 11 11 0
## med_high 0 2 21 0
## high 0 0 0 29
In the above picture we can see how the model perdicts the Crime classes. The model is able to predict closely to real value in “high”" and “low”" rates.
Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
data("Boston")
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled)
dist_eu <- dist(Boston)
summary (dist_eu)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
dist_man <- dist(Boston, method = 'manhattan')
summary (dist_man)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
# k-means clustering
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster )#K-means: determine the k
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')According to the qplot above, the optimal number of clusters is around 2, because afther that the WCSS changes dramatically
km <-kmeans(Boston, centers = 2)
#plot focus on some variables
pairs(Boston , col = km$cluster)pairs(Boston[1:5] , col = km$cluster)pairs(Boston[6:10] , col = km$cluster)pairs(Boston[11:14] , col = km$cluster)model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)## [1] 404 13
dim(lda_crime$scaling)## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_crime$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
## Please upgrade via install.packages('htmlwidgets').
library (ggplot2)
library (tidyr)
library (stringr)
library (dplyr)
library (corrplot)
library(GGally)
library(stringr)
library (psych)#read data
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep = ",", header = T)
#look at 10 first values
head (human)## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth
## Norway 1.0072389 0.8908297 17.5 81.6 64992 4 7.8
## Australia 0.9968288 0.8189415 20.2 82.4 42261 6 12.1
## Switzerland 0.9834369 0.8251001 15.8 83.0 56431 6 1.9
## Denmark 0.9886128 0.8840361 18.7 80.2 44025 5 5.1
## Netherlands 0.9690608 0.8286119 17.9 81.6 45435 6 6.2
## Germany 0.9927835 0.8072289 16.5 80.9 43919 7 3.8
## Parli.F
## Norway 39.6
## Australia 30.5
## Switzerland 28.5
## Denmark 38.0
## Netherlands 36.9
## Germany 36.9
names (human)## [1] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" "GNI" "Mat.Mor"
## [7] "Ado.Birth" "Parli.F"
str (human)## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary (human)## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
human$GNI <- str_replace(human$GNI, pattern=",", replace ="") %>% as.numeric()
str(human$GNI)## num [1:155] 64992 42261 56431 44025 45435 ...
#plot to see correlation between variables in data
fig1 <- pairs(human, pch = 21,bg = c ("red", "green3", "blue", "purple", "violet", "yellow", "orange","grey"))fig2 <-cor(human) %>% corrplot.mixed() * variables explanation: The dataset has 155 observation and 8 variables Edu2.FM: education ratio
Labo.Fm: participation ratio
Edu.Exp: mean of yeas of education
Life.Exp: measures life expectancy at birth
GNI: Gross national income per capital
Mat.Mor: maternity mortality
Ado.Birth: Adolescent of birth rate
Parli.F: Percetange of female representatives in parliament
From the fig1 and fig2, we can see the correlations between the variables.
There seems to be strong positive correlations between: Edu.Exp and Life.Exp, Mat.Mor and Ado.Birth…
A negative correlation: between Mat.Mor and Life.Exp…
pca_human.notstand <- prcomp(human)
summary1<- summary (pca_human.notstand)
print (summary1)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*summary1$importance[2, ], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human.notstand, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
From the unstandardized PCA, we can see that the PCs cannot be separated.
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
summary2<- summary(pca_human_std)
print (summary2)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
pca_pr <- round(100*summary2$importance[2, ], digits = 1)
pca_pr## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human_std, cex = c(0.8, 1), col = c("orange", "green4"), xlab = pc_lab[1], ylab = pc_lab[2])In the standardized model the first components (PC1) explain 53.6 % of the variance and the second dimension (PC2) explaints 16.2 % of the variance. PC3 has 9.5 % of the variance and so on
install packages Factominer first
install.packages(“FactoMineR”)
library(FactoMineR)
library (ggplot2)
library (tidyr)
library (dplyr)
library (corrplot)
library(GGally)
library(stringr)
library (psych)
#calling data tea
data(tea)
#structure of data tea
str(tea)## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#dimentions of data
dim (tea)## [1] 300 36
# tea includes 300 observation with 36 variables
# visualize 10 first variables
fig5 <- pairs(tea[,1:10])# fig5 shows that correlation between first 10 variables are not linear..
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
head (tea_time)## Tea How how sugar where lunch
## 1 black alone tea bag sugar chain store Not.lunch
## 2 black milk tea bag No.sugar chain store Not.lunch
## 3 Earl Grey alone tea bag No.sugar chain store Not.lunch
## 4 Earl Grey alone tea bag sugar chain store Not.lunch
## 5 Earl Grey alone tea bag No.sugar chain store Not.lunch
## 6 Earl Grey alone tea bag No.sugar chain store Not.lunch
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))## Warning: attributes are not identical across measure variables;
## they will be dropped
mca <- MCA(tea_time, graph = FALSE)
summary(mca)##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
par(mfrow=c(1,2))
plot(mca, invisible=c("ind"), title = "before")plot(mca, invisible=c("ind"), habillage = "quali", title = "after") The plot above shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.
The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.
***